We first need to load the data set I found from ‘data.world’ at [website link].
## Getting my Data Set
data <- read.csv("~/Library/CloudStorage/GoogleDrive-ccoonce@asu.edu/My Drive/DAT 301/ProjectModule4/inform8n-us-national-parks-visitation-1904-2016-with-boundaries/data/all_national_parks_visitation_1904_2016.csv")
First, I would like to understand the structure of this data set.
## Exploring my Data Set
head(data, 3)
summary(data)
## created_by measure_selector year date_edit
## Length:21560 Length:21560 Length:21560 Length:21560
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## scrapeurl gis_notes gnis_id geometry
## Length:21560 Length:21560 Length:21560 Length:21560
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## metadata number_of_records parkname region
## Length:21560 Length:21560 Length:21560 Length:21560
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## state unit_code unit_name unit_type
## Length:21560 Length:21560 Length:21560 Length:21560
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## visitors yearraw
## Min. : 0 Length:21560
## 1st Qu.: 39125 Class :character
## Median : 155219 Mode :character
## Mean : 1277105
## 3rd Qu.: 608144
## Max. :871922828
## NA's :4
This set has 18 columns consisting of 17 ‘character’ variables and one ‘int’ variable. The various columns contain information like park name, region, state, type of park, and number of visitors. Each observation (or row) is a different park by year.
After seeing the structure of this set, I think the first question I would like to answer is which parks I should plan to visit based solely on popularity by number of visitors.
First, Let’s clean some things up by removing unnecessary columns for my analysis.
## Select fewer columns
data_clean <- select(data, region:yearraw)
## Check for missing or problem values
year_list <-data_clean %>% count(data_clean$yearraw)
head(year_list,4)
tail(year_list,2)
sum(is.na(data_clean$visitors))
## [1] 4
After looking at the smaller subset of data, the ‘yearraw’ column has a ‘Total’ section that will cause issues during analysis. I will also convert it to a numeric value rather than character value. In addition, the ‘visitors’ column has 4 values that are not available. I will remove those observations to streamline my exploration.
## Removing observations with NA's
data_clean <- na.omit(data_clean)
## Subset the data
data_totals <- subset(data_clean, yearraw == "Total")
data_clean <- subset(data_clean, yearraw != "Total")
data_clean$yearraw <- as.numeric(data_clean$yearraw)
I want to understand the trend for all National park visitation over the years before sub-setting to smaller groups. This will help me compare national trends to the trends of different areas.
sum_by_year <- aggregate(visitors ~ yearraw, data_clean, sum)
p1 <- ggplot(data = sum_by_year, aes(x = yearraw, y = visitors)) +
geom_line() +
scale_y_continuous(labels = label_number()) +
theme_minimal() +
labs(title = "Total National Park Visitation per Year (1904-2016)",
x = "Year", y = "Total Visitors (in Millions)")
p1
This plot shows how visitor numbers have changed over time. It appears there is an overall increasing trend in the number of visitors, which could be attributed to a variety of factors that would require a deeper dive.
The data set is subdivided into regions as shown in the info graphic below.
I am only interested in regions near me. So I am going to look at the region I am in as well as surrounding regions.I need to group and filter my data set to get the appropriate observations.
## group and filter by region
region_total <- data_clean %>%
filter(region == "PW" | region == "IM" |
region == "MW") %>%
group_by(yearraw, region) %>%
summarise(Total = sum(visitors), .groups = "rowwise") %>%
na.omit()
## Line Plot by Selected Regions
p2 <- ggplot(region_total, aes(x = yearraw, y = Total, group = region, color = region)) +
geom_line() +
scale_y_continuous(labels = label_number_si()) +
theme_minimal() +
labs(title = "Total National Park Visitation
per Year (1904-2016)", x = "Year",
y = "Total Visitors (in Millions)") +
facet_wrap(~ region) +
theme(axis.text.x = element_blank())
p2
These plots show that Intermountain and Pacific West regions have seen the steepest increases in visits. I will focus my attention on these two regions.
In order to show total visitation by state I need to filter and group my data set.
Next, I need to add location data in order to create a heat map of my new data. I can use the ‘usmap’ package to create an interesting plot.
# Get the states map data from usmap
states_map <- usmap::us_map(regions = "states") %>%
filter(abbr %in% c("WA", "OR", "CA", "MT", "WY", "CO", "UT", "NV", "ND", "SD", "NE", "ID", "AZ", "NM"))
# Merge visitor data with map data
merged_data <- merge(states_map, state_data, by.x = "abbr", by.y = "state", all.x = TRUE)
Now I can prepare my base layer for adding multiple data visualizations.
# Plot the outlines of the states with heatmap
base_map <- ggplot() +
geom_polygon(data = merged_data, aes(x = x, y = y, group = group), fill = "white", color = "black", linewidth = 0.25)
p3 <- ggplot() +
geom_polygon(data = merged_data, aes(x = x, y = y, group = group, fill = Total), color = "black") +
scale_fill_continuous(labels = label_number_si(), low = "white", high = "red", name = "Total Visits", na.value = NA) +
theme_void() + # Removes axis labels and background for a cleaner look
labs(title = "Heatmap of Total National Park Visits by Selected Western US States")
p3
This heatmap shows which states have the most visitors. Based on the results, I will pick the top parks in California, Wyoming, and Washington. I’d like to start thinking about weighing the distance from my home and the number of visitors to see if it can help me decide.
I will need to find another dataset to add latitude and longitude values for the top parks. and add that information into my top parks data I started preparing below.
top_parks <- data_clean %>%
group_by(unit_name, state) %>%
summarise(Total = sum(visitors), .groups = "rowwise") %>%
arrange(desc(Total)) %>%
filter(state %in% c("WA", "OR", "CA", "MT", "WY", "CO", "UT", "NV", "ND", "SD", "NE", "ID", "AZ", "NM"))
I was able to compile the coordinates for a number of the most visited parks. I need to create a data frame before I can merge my top parks data with these locations.
## Creating a new data frame from the coordinates I compiled
geo_data <- read_csv("geo_data.txt", col_types = cols(Latitude = col_number(),
Longitude = col_number()))
## adding coordinates to my list of top parks
top_parks <- merge(top_parks, geo_data,
by.x = "unit_name", by.y = "Park Name")
## rearranging columns to make manipulation easier
top_parks <- select(top_parks, unit_name, state, Total, Longitude, Latitude)
After Merging, I now need to convert the Longitude and Latitude to the ‘usmap’ coordinate system in order to map locations onto my base layer.
## 'usmap' function to transform Latitude and Longitude to 'usmap' coordinates.
transformed_top_parks <- usmap_transform(top_parks, input_names = c("Longitude", "Latitude"))
Now I can create a plot to summarize my findings.
p4 <- base_map +
geom_point(data = transformed_top_parks, aes(x = x, y = y, size = Total), alpha = 0.7, color = "blue") +
scale_size_area(max_size = 10, name = "Total Visitors") +
labs(title = "Bubble Plot of National Parks in Western US States", x = NULL, y = NULL) +
theme_minimal() +
theme(legend.position = "right",
text = element_text(size = 12),
title = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
coord_quickmap()
p4
Now that I have a better idea of where specific top parks are I can start narrowing my search. I think I should consider distance from my home.
I will start by loading and converting my hometown coordinates to the ‘usmap’ coordinate system.
## Coordinates for Helena, Montana
helena_coords <- data.frame(Longitude = -112.0372, Latitude = 46.5891)
helena_coords_transformed <- usmap_transform(helena_coords, input_names = c("Longitude", "Latitude"))
Now I need to use the Haversine formula which is useful for determining the shortest path between two points on the Earth using the Latitude and Longitude coordinate system.
## Calculate distances from Helena to each park
transformed_top_parks$distance <- distHaversine(matrix(c(helena_coords$Longitude,
helena_coords$Latitude), nrow = 1),
matrix(c(transformed_top_parks$Longitude,
transformed_top_parks$Latitude),
ncol = 2), r=6378137)
## Converting to miles from Km
transformed_top_parks$distance_miles <- transformed_top_parks$distance / 1609.34
In order to plot this information I want to add my hometown coordinates to the top parks data frame.
## that repeats Helena's coordinates
helena_repeated <- data.frame(
x = rep(helena_coords_transformed$x, nrow(transformed_top_parks)),
y = rep(helena_coords_transformed$y, nrow(transformed_top_parks))
)
## Add Helena's coordinates to the 'transformed_top_parks' data frame
transformed_top_parks$helena_x <- helena_repeated$x
transformed_top_parks$helena_y <- helena_repeated$y
And finally, I will create the plot.
## Creating A second Data Frame for future plot
top_parks_cluster <- transformed_top_parks
## Filter to only the top parks by visitation
transformed_top_parks <- arrange(transformed_top_parks,desc(Total)) %>%
filter(transformed_top_parks$Total >= 60000000)
## Now plot using the 'transformed_top_parks' dataframe
p6 <- base_map +
geom_segment(data = transformed_top_parks,
aes(x = helena_x, y = helena_y, xend = x,
yend = y, color = distance_miles),
arrow = arrow(length = unit(0.1, "inches"))) +
scale_color_viridis_b(name = "Distance (miles)") +
coord_quickmap() +
labs(title = "Distance from Helena, MT", x = NULL, y = NULL) +
theme(legend.position = "right",
text = element_text(size = 12),
title = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.background = element_rect(fill = "white", color = "white"))
p6
After seeing this visualization, I want to pivot to a new problem. How I can optimize seeing multiple parks per trip rather than trying to choose one park. There are to0 many nearby! I need to use a clustering model to group the parks by looking at their distances from each other. To do this I need to use a hierarchical cluster function.
## Perform a hierarchical clustering
## Creating distance matrix
dist_matrix <- dist(top_parks_cluster[,c('Longitude', 'Latitude')])
clusters <- hclust(dist_matrix)
## Split into clusters of parks
park_clusters <- cutree(clusters, k = 5)
top_parks_cluster$cluster <- park_clusters
With my new set of clustered parks, I want to see how the clusters worked out.
p7 <- base_map +
geom_point(data = top_parks_cluster, aes(x = x, y = y, color = factor(cluster)), size = 2, alpha = 0.9) +
scale_color_brewer(palette = "Set1") + # Use a qualitative color palette
labs(title = "Cluster Analysis of National Parks",
color = "Cluster") +
theme_minimal() +
coord_quickmap() +
labs(title = "Cluster Results", x = NULL, y = NULL) +
theme(legend.position = "right",
text = element_text(size = 12),
title = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.background = element_rect(fill = "white", color = "white"))
p7
Not a bad result, but there are a few parks I want to change to a different cluster.
top_parks_cluster$cluster[12] <- 1
top_parks_cluster$cluster[33] <- 1
top_parks_cluster <- top_parks_cluster[-21, ]
top_parks_cluster <- top_parks_cluster[-6, ]
top_parks_cluster <- top_parks_cluster[-3, ]
p7.1 <- base_map +
geom_point(data = top_parks_cluster, aes(x = x, y = y, color = factor(cluster)), size = 2, alpha = 0.9) +
scale_color_brewer(palette = "Set1") + # Use a qualitative color palette
labs(title = "Cluster Analysis of National Parks",
color = "Cluster") +
theme_minimal() +
labs(title = "Adjusted Cluster Results", x = NULL, y = NULL) +
theme(legend.position = "right",
text = element_text(size = 12),
title = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.background = element_rect(fill = "white", color = "white"))
p7.1
Better! Now I want to know what the most efficient way to drive from park to park is so I will use the Traveling Salesman Problem (TSP). The TSP finds shortest possible route that visits each city exactly once and returns to the original city. I would likely be able to get a more accurate route optimization if I was using road directions, rather than straight distances. I think that a basic TSP is more than enough for this situation.
## Initialize an empty list to store results
tsp_results <- list()
## Loop through each cluster
for (i in 1:max(top_parks_cluster$cluster)) {
## Subset parks for the current cluster
cluster_parks <- top_parks_cluster[top_parks_cluster$cluster == i, ]
## Calculate the distance matrix for the current cluster
dist_matrix <- as.dist(distm(cluster_parks[, c("Longitude", "Latitude")],
cluster_parks[, c("Longitude", "Latitude")],
fun = distHaversine))
## Solve the TSP problem using the nearest neighbor heuristic
tsp_solution <- solve_TSP(TSP(dist_matrix), method = "nearest_insertion")
## Solution List
tsp_results[[paste("Cluster", i)]] <- tsp_solution
}
Now that I have the results, I can visualize what the optimal route is. In order to do that, I want to plot them all on the same map which will require creating a function to compile all the data points into one table.
## Function to create segments from the TSP solution
get_tsp_segments <- function(cluster_number, top_parks_cluster, tsp_results) {
## Filter the parks by the selected cluster
selected_cluster <- filter(top_parks_cluster, cluster == cluster_number)
## Retrieve the TSP order for the cluster
tsp_order <- tsp_results[[paste("Cluster", cluster_number)]]
tsp_solution_order <- TOUR(tsp_order)
## Order the parks according to the TSP solution
ordered_cluster <- selected_cluster[tsp_solution_order, ]
## Close the loop by returning to the starting point
ordered_cluster <- rbind(ordered_cluster, ordered_cluster[1, ])
## Create segments for each leg of the TSP path
segments <- data.frame(cluster = cluster_number,
x = ordered_cluster$x[-nrow(ordered_cluster)],
y = ordered_cluster$y[-nrow(ordered_cluster)],
xend = ordered_cluster$x[-1],
yend = ordered_cluster$y[-1]
)
return(segments)
}
## Call the function
all_segments <- lapply(1:max(top_parks_cluster$cluster), function(cluster_num) {
get_tsp_segments(cluster_num, top_parks_cluster, tsp_results)
})
## Combine all segments into one data frame
all_segments_df <- do.call(rbind, all_segments)
## Creating plot 8
p8 <- base_map +
geom_segment(data = all_segments_df,
aes(x = x, y = y, xend = xend, yend = yend,
group = cluster, color = as.factor(cluster),
),
linewidth = unit(0.6, "cm"),
arrow = arrow(length = unit(0.2, "cm"))) +
scale_color_brewer(palette = "Set1", name = "Routes") +
geom_point(data = all_segments_df,
aes(x = x, y = y, group = cluster),
size = .25) +
labs(title = "Multi-park Trip Options", x = NULL, y = NULL) +
theme(legend.position = "right",
legend.background = element_rect(fill = "white",
color = "white"),
legend.box.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white", colour = "white"),
text = element_text(size = 12),
title = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.background = element_rect(fill = "white"))
p8